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Camera/image-based localization is important for many emerging 
applications such as augmented reality (AR), mixed reality, robotics, and 
self-driving. Camera localization is the problem of estimating both camera 
position and orientation with respect to an object. Use cases for camera 
localization depend on two key factors: accuracy and speed (latency). 
Therefore, this paper proposes Depth-DensePose, an efficient deep learning 
model for 6-degrees-of-freedom (6-DoF) camera-based localization. The 
Depth-DensePose utilizes the advantages of both DenseNets and adapted 
depthwise separable convolution (DS-Conv) to build a deeper and more 
efficient network. The proposed model consists of iterative depth-dense 
blocks. Each depth dense block contains two adapted DS-Conv with two 
kernel sizes 3 and 5, which are useful to retain both low-level as well as 
high-level features. We evaluate the proposed Depth-DensePose on the 
Cambridge Landmarks dataset, which shows that the Depth-DensePose 
outperforms the performance of related deep learning models for camera 
based localization. Furthermore, extensive experiments were conducted 
which proven the adapted DS-Conv is more efficient than the standard 
convolution. Especially, in terms of memory and processing time which is 
important to real-time and mobile applications. 
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1. INTRODUCTION 


Nowadays, visual (camera) localization is an active research field due to the rapid development and 
technological advances in computer vision and mobile computing. Visual localization is to estimate the pose 
(position and orientation) of a camera. Camera localization is an important step in emerging applications such 
as augmented reality (AR), mixed reality, robotics, and self-driving [1]. 

Generally, there are two categories of traditional visual localization methods: image-based and 
structure-based methods [2], [3]. In Image-based methods, the 6-degrees-of-freedom (6-DoF) camera poses 
are estimated by retrieving the nearest image with known ground truth poses in a reference database. Matches 
between images are performed by searching through the global descriptor space [4]. This descriptor may be a 
hand-crafted feature (such as scale invariant feature transform (SIFT) ([5], dominate SIFT [6], speeded-up 
robust features (SURF) [7], or oriented fast and rotated BRIEF (ORB) [8]) or learned features (such as 
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SuperPoint [9], ASLFeat [10], or SeqNet [11]). Structure-based methods predict the 6-DoF camera poses by 
matches local descriptors of 2D points from the query image and a reconstructed 3D point-cloud of the whole 
scene. Local descriptors are similar to global descriptors which may be hand-crafted or learned. Although the 
hand-crafted feature methods are accurate and robust in many situations (especially in outdoor 
environments). The indoor feature-based localization is still an issue since indoor scenes often have less 
texture, repetitive structures, and insufficient local/global features for matching [12], [13]. 

Recently, the convolutional neural networks (CNN) gained salient success to solve many computer 
vision problems [14]-[17]. Accordingly, CNN has been used to guess the camera pose for an input image. 
The camera localization problem was defined as a regression problem to estimate the absolute pose by using 
CNN [18]. Kendall et al. [19] developed the PoseNet which is a deep learning model for regressing the 6- 
DoF camera pose directly from the input image. As a result of PoseNet success, many deep learning models 
were developed for camera pose estimation. 

In this paper, we propose, the Depth-DensePoseNet, an efficient end-to-end deep learning model for 
camera/image-based localization. The architecture of Depth-DensePose has many advantages: i) adopt the 
DenseNets [20] architecture, which lead to developing a deeper model, fast training, and produce more 
accurate results; ii) develop the depthwise separable convolution (DS-Conv) with two kernel sizes 3 and 5 
instead of the standard convolution. The kernel size 3 is useful to obtain the low-level features while the 
kernel with size 5 is useful to obtain the high-level features. As a result, the DS-Conv requires memory less 
than the standard convolution with similar performance [21]; and iii) reduce the average pooling layers by 
adopting a convolution with stride=2, which preserves the resolution and requires less graphics processing 
unit (GPU) memory. 

We evaluate the proposed Depth-DensePose on Cambridge Landmarks dataset [19], [22]. The 
results of these experiments show that, the Depth-DensePose model achieves lower mean squared errors 
(MSE) of the camera position to 0.74 m and orientation to 2.14 degree on the entire Cambridge dataset. From 
the results, we can conclude that the Depth-DensePose outperforms the related models for camera-based 
localization tasks. Other extensive experiments were conducted which proven the adapted DS-Conv is more 
efficient than the standard convolution. The Depth-DensePose using DS-Conv requires a 40% lower memory 
size and 20% less training time than standard convolution. These results are important to real-time and 
mobile applications. It is worth mention that, the total number of layers of the Depth-DensePose using DS- 
Conv is larger. Because the DS-Conv uses two different kernel sizes. 

This paper is organized: section 2 presents some important related work. Section 3 discusses the 
proposed system and method. Section 4 discusses the Depth-DensePose implementation, experiments and 
results. Finally, section 5 presents the conclusion. 


2. RELATED WORK 

In traditional methods, the extracted handcraft features of the input image are used to search for the 
best pose (localization) which matches the stored features. Deep learning methods directly learn good 
representations (encoding) of the input images at several levels of details. The representation could be 
unknown features, depth, or even 3D motion between two images for localization problems. Generally, the 
most popular three architectures of deep neural networks (DNNs) have been applied to solve the localization 
problem. The three architectures are convolutional neural network (CNN), encoder-decoder network, and 
recurrent neural network (RNN) [23]. 

Kendall et al. [19] suggested using a modified architecture from the well-known DNN architecture 
for classification, GoogLeNet [24], as a backbone for creating a pose regression network. The suggested 
architecture, named PoseNet [19]. GoogLeNet consists of six inception modules plus two intermediate 
classifiers that form a total of 22 layers. Each inception module is consisting of a stack of 3x3 filters, 5x5 
filters, and a max-pooling layer which are useful to build a powerful model. 

Nevertheless, however, the PoseNet accuracy is less than the accuracy of some traditional methods. 
This is due to the large size (2048D) of the input feature vector to the regressor layer. Also, increasing the 
size of the feature vector causes the overfitting problem, which leads to increased testing errors when 
working on test images that differ from the training images. Generalization is another important problem that 
not fully addressed in the original PoseNet model. 

As a result of PoseNet success in encoding the localization problem as a regression task, many deep 
learning models were developed as enhancements on the PoseNet. Walch et al. [25] suggested using the long 
short term memory (LSTM) to reduce the feature vector size which addresses the overfitting problem. In this 
architecture, four LSTM units after used to reduce the size of the feature vector. LSTM is a type of RNN that 
consists of some hidden states that collect or omit relevant contextual features. Recently, combine CNN with 
LSTM is useful to solve some computer vision problems [25]. 
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Hourglass-Pose [26] is another architecture based on the ResNet34 which is an encoder-decoder 
model [27]. The ResNet consists of four residual blocks, each one consists of convolution, activation, and 
batch normalization layers. There are direct (skip) connections between the corresponding encoder and 
decoder blocks. These connections are useful to preserve low-level details which help to solve the vanishing 
gradient problem. 

In SVS-Pose [28], the VGG16 model [29] was utilized instead of the GoogLeNet model. The SVS- 
Pose uses a 3x3 filter throughout the network. The VGG16 contains three Fully Connected (FC) layers after 
the convolutional layers. The VGG16 model was cut out after the first FC into two branches, one to estimate 
the camera position and the other to estimate the orientation separately. There are two add FC layers at the 
end at the estimate the final position and orientation. Instead of using one single shared localizer model, 
BranchNet [10] cuts the PoseNet architecture after the 5th inception module as a shared encoder. The reset 
layers are duplicated to form two branches that estimate position and orientation separately. 

Instead of returning a single estimated camera pose for the input image, the MDPoseNet returns 
multiple estimated values. The main MD-PoseNet idea is that the network returns the distribution of all 
potential camera poses instead of the most potential camera pose [30]. Unlike single-image localization, 
some works have extended PoseNet to the time domain to address temporal localization. VidLoc [31] 
performs localization for a small video-clips by bidirectional recurrent neural networks (BLSTM) [32]. 
VLocNet [33] and VLocNet++ [18] proposed a network to jointly learn pose regression and visual odometry. 
The Kalman filtering (KFNet) model [34] utilized the Kalman filtering [35] to improve the temporal 
localization for the online cameras. 

In deep learning research, increasing the number of layers leads to build a more accurate model [36], 
[37]. Nevertheless, the deeper model may cause pop/vanish features problems that negatively affect the 
training results. To address these issues, Huang et al. [20] developed a densely connected network named, 
DenseNets. In the DenseNets architecture, there are dense connections between the consecutive blocks, 
which address the popping/vanishing problem and produce more accurate results. Moreover, dense 
connections between blocks are useful to overcome the over-fitting problem that occurs in training on a 
smaller dataset. DenseNets achieve high performances for many computer vision problems [37]-[41]. 


3. THE PROPOSED SYSTEM AND METHOD 

According to the success of DenseNets in image classification and segmentation tasks [36], [37], we 
utilize its structure to build an efficient Depth-DensePose model for image pose regression. However, the use 
of DenseNets requires taking the following challenges into account: i) the standard DenseNets [20] was 
initially proposed for image classification tasks and not for regression tasks, ii) the structure of DenseNets 
actually consists of many max-pooling layers that decrease the resolution of features, iii) the DenseNets 
complexity is very high and requires high GPU memory. This limits some important factors that required to 
achieve high performance such as image size, number of layers, kernel size (usually 1x1 and 3x3), and the 
training dataset size [42]. 

Based on the above challenges, we proposed an efficient Depth-DensePose model for camera-based 
localization. The Depth-DensePose structure has the following advantages: 

— Adopt the DenseNets [20], [37] architecture, which lead to developing a deeper model, fast training, and 
produce more accurate results. 

— Develop the depthwise separable convolution (DS-Conv) with two kernel sizes 3 and 5 instead of the 
standard convolution. The kernel size 3 is useful to obtain the low-level features while the kernel with 
size 5 is useful to obtain the high-level features. As a result, the DS-Conv requires memory less than the 
standard convolution with similar performance [21]. 

— Reduce the average pooling layers by adopting a convolution with stride=2, which preserves the 
resolution and requires less GPU memory. 

As show in Figure 1, DS-Conv contains a stack of depthwise and pointwise convolutions. In 
depthwise, the convolution operation is applied separately on each input channel. After that, a pointwise 
convolution is applied to merge the output channels from the depthwise convolution. The DS-Conv 
dramatically decreases the memory consumption and the computational complexity. Compared with standard 
convolution, the convolution process doesn't perform across all channels. That means the number of 
connections is fewer and the model is lighter. In this paper, we adapted the DS-Conv with two kernel sizes 3 
and 5. The kernel size 3 is useful to obtain the low-level features while the kernel with size 5 is useful to 
obtain the high-level features. 

The Depth-DensePose model consists of iterative depth-dense blocks, as in Figure 2 (in Appendix). 
Each depth-dense block consists of two DS-Conv with different kernel sizes (exactly, 3 and 5). There are 
links between the depth-densely blocks which maintains the resolution and enhances the training results [2]. 
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Figure 1. A proposed depthwise separable convolution with two kernel sizes (3 and 5) 


After each convolution, a batch normalization (BN) and activation layers were applied. The BN is 
useful to create wider and deeper models. We implement the rectified linear unit (ReLU) as an activation 
layer. The ReLU is applied to optimize the output channels and enhanced the performance [43]. After each 
depth-dense block, a transition block is implemented to decrease the feature vector size. Each transition block 
consists of a 1x1 convolution layer followed by a BN, ReLU, and finally, convolution layer with stride 2 
instated of the max-pooling layer. Like the PoseNet, the proposed structure has three regressor blocks that are 
applied after the transition layer 1, 2, and the final depth-dense blocks 4. Each regressor block consists of: i) 
an average pooling layer with appropriate kernel size as shown in Figure 2; ii) a 1x1 convolution with 128 
kernels for dimension reduction followed by a Flatten layer. Flattening is converting the input two- 
dimensional vector into a one-dimensional vector; iii) FC layer with 512 outputs, then a ReLU activation 
layer, and iv) a fully connected regressor with 7 output values that represent the estimated pose. 

The proposed model depth is scalable based on the iterative number of each depth-dense block (i.e 
the values of [N1, N2, N3, N4]). The scalability is very useful to update the depth based on the size of the 
dataset. It’s better to use a deeper model when there is a large-scale dataset. Based on experiments, the 
number of depth-dense blocks [6, 8, 12, 6] gives the best results for the Cambridge Landmarks [19], [22] 
dataset. 


4. RESULTS AND DISCUSSION 
4.1. Implementation and training scheme 

The proposed model was implemented by PyTorch [44]. PyTorch is an open-source library 
developed by Facebook's AI Research lab, used for machine/deep learning and computer vision applications. 
Table 1, lists the important implementation and training parameters. The implementation and training are 
done on a device equipped with NVIDIA RTX 2060 with 6 GB memory and 16 GB of RAM. 

Transfer learning (TL) is very important to boost performance [45], [46]. In machine learning (ML), 
transfer learning is a methodology that concentrates on storing knowledge gained while solving one problem 
and utilizing it to a different but related problem. Therefore, we adopted the methodology of cascaded 
training [37], which leads to fast convergence and good performance. In this methodology, the training will 
conduct as iterative steps. In the first iteration, the model will train for 1,000 epochs. In the second iteration, 
the best training weights gained in the first iteration are utilizing, and the model will train for additional 1,000 
epochs and so on. When there are no enhanced results, the cascaded training will stop. 


Table 1. The important implementation and training parameters 


Parameters Values 
Learning rate 0.0001 
Regressor weights (w1, w2, w2) (0.3, 0.3, 1) 
Batch sizes 16, 32..... 128 (based on the GPU and size of inputs) 
Depth dense blocks [N1, N2, N3, N4] [6, 8, 12, 6] 
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4.2. Experiments 

Three experiments were conducted to evaluate the proposed Depth-DensePose: i) the first 
experiment is to evaluate the effectiveness of the proposed model compared to the related work; ii) the 
second experiment is to evaluate the effectiveness of using the DS-Conv compared to the standard 
convolution; and iii) the final experiment is to evaluate the efficiency of using the DS-Conv compared to the 
standard convolution. 


4.2.1. Dataset 

To evaluate the effectiveness, the depth-DensePose was trained and evaluated on the Cambridge 
Landmarks [19], [22] dataset. The Cambridge dataset is a large urban localization dataset consists of 6 scenes 
around the Cambridge University. Each scene is split into a set of training and testing images. The 
Cambridge dataset ware captured using a phone camera and contains about 12,000 images labeled with their 
full 6-DoF camera pose. The resolution of each image is 1920x1080 px. For complexity reduction, many 
temporal frames were removed. Therefore, accurate temporal information is not available. The dataset labels 
are produced using the visual structure from motion (visual SfM) [47]. We applied several random scaling 
and transformations on the datasets, which lead to reduce the complexity and augment the dataset. The input 
images are randomly resized to 256x341 and randomly cropped to 224x224. 


4.2.2. Effectiveness evaluation compared to the related work 

The proposed Depth-DensePose was evaluated against the related work, especially PoseNet [19], 
DensePoseNet [19], LSTM-Pose [25], SVS-Pose [28], and VLocNet [33]. The MSE metric was utilized in 
these experiments of translation (in meters) and rotation. Table 2 summarizes the experimental MSE of 
position (in meters) and rotation (in degrees). According to these experiments, several results can be drawn. 
First, the proposed Depth-DensePose achieves a lower MSE and clearly outperforms the other related 
approaches. In general, the Depth-DensePose reduces the average MSE of the camera position to 0.74 and 
orientation to 2.14 for the entire Cambridge dataset. Second, the Depth-DensePose is even better than 
VLocNet [33] which failed to run on a large dataset. Furthermore, the scalability of the Depth-DensePose 
makes it works well on large as well as small datasets. The results in Figure 3, prove that the proposed 
architecture is beneficial to solve the camera-based localization problem. 


Table 2. The Depth-DensePose MSE compared with other related model on the cambridge landmarks dataset 


K. College Old Hospital Shop Façade Street St M. Church Average 
P R P R P R P R P R P R 
PoseNet 1.92m 540° 231m 5.38 146m 8.08 3.67m 650° 2.5m 848 240m _ 6.77 


DensePoseNet 166m 486 257m 5.14 141m 7.18 296m 6.00 245m 7.96 2.21 m 6.23° 
LSTM-Pose 0.99m 3.65 151m 4.29 1.18m 7.44 NA NA 1.52m 6.68 1.3m 5.52 
SVS-Pose 1.06m 2.81 1.50m 4.03" 0.63m 5.73 NA NA 2lim 81r 1.33 m 5.17 
VLocNet 0.84m 1.42? 107m 2.47 0.59m 3.57 NA NA 063m 3.91 078m 28X 
Depth-DensePose 0.69m 1.23? 0.60m 0.82) 0.38m 3.04 1.24m 3.34 078m 2.26 074m 21X 
Note: P > Position, R > Rotation 
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Figure 3. Evaluation of the Depth-DensePose model compared to the related models 
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4.2.3. Effectiveness of the adopted DS-Conv compared to the standard convolution. 

Figures 4(a)-(f) show the results of experiments were conducted to evaluate the effectiveness of the 
adopted DS-Conv compared to the standard convolution. The proposed model was trained on the Shop 
Façade scene using DS-Conv and standard convolution. Training results were monitored and compared for 
different training epochs (especially, 100, 200, 300, and 400). From the results we can draw that, the 
effectiveness of the DS-Conv is comparable with the standard convolution. Furthermore, the training using 
DD-Conv archives lower training loss. 
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Figure 4. The Depth-DensePose training losses for different training epochs conducted on the shop façade 
scene (a) first 100 epochs using standard conv, (b) second 100 epochs using standard conv, (c) third 100 
epochs using standard conv, (d) first 100 epochs using DS-Conv, (e) second 100 epochs using DS-Convy, 

and (f) third 100 epochs using DS-Conv 


4.2.4. Efficiency of the adopted DS-Conv compared to the standard convolution 

From Table 3 several conclusions can be drawn. First, the number of (trainable and total) parameters 
of the Depth-DensePose using the DS-Conv is notably lower than the Depth-DensePose using the standard 
convolution. Second, the Depth-DensePose using DS-Conv requires a lower memory size than standard 
convolution by 40%. This advantage is important to real-time and mobile application where memory size is 
limited. Other experiments were conducted to estimate the training time of the Depth-DensePose using DS- 
Conv compared to using standard convolution. Table 4, summarizes the results of the training time of one 
epoch for different patch sizes (especially 4, 16, and 32). As a result, the training time of the Depth- 
DensePose using DS-Conv is 20% lower than using standard convolution. It is worth mention that, the total 
number of layers of the Depth-DensePose using DS-Conv is larger. Because the DS-Conv uses two different 
kernel sizes. 


Table 3. Params size of the Depth-DensePose using DS-Conv compared to standard convolution 
The Depth-DensePose using DS-Conv The Depth-DensePose using standard Conv 


Total params 5,571,797 9,065,685 
Trainable params 5,571,797 9,065,685 
Params size (MB) 21.25 34.58 


Table 4. Training time (in second) of the Depth-DensePose using DS-Conv compared to standard Conv on 
the Cambridge dataset 
The Depth-DensePose using DS-Conv The Depth-DensePose using standard Conv 


4 16 32 4 16 32 
K. College 70.13 38.32 28.15 86.99 42.37 35.41 
Shop Facade 14.99 7.63 7.18 19.20 9.43 8.12 
Street 288.63 127.89 107.47 344.38 186.57 157.47 
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In this paper, we propose, the Depth-DensePose, an efficient deep learning model for camera-based 
localization. The proposed Depth-DensePose utilizes both advantages of DenseNets and adapted DS-Conv to 
build a deeper and more efficient network. The results of the experiments demonstrated that, the Depth- 
DensePose achieves a lower mean squared error and clearly outperforms the other related approaches. 
Moreover, the proposed DS-Conv with two kernel sizes has a greater positive impact on the performance. 


APPENDIX 
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Figure 2. The proposed Depth-DensePose model for camera-based localization, where N1=6, N2=8, N3=12, 


and N4=6 
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